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<N ; Abstract 

We present a new, simple, and efficient approach for computing the Lempel-Ziv (LZ77) 
factorization of a string in linear time, based on suffix arrays. Computational experiments 
on various data sets show that our approach constantly outperforms the fastest previous 
algorithm LZJDG ( Ohlebusch and Gog 201 1 ), and can be up to 2 to 3 times faster in the 
g : processing after obtaining ft. suffix array, „Ule requiring ,he same or a U„le more space. 

CN ! 1 Introduction 

> 

The LZ77 factorization [Hfl of a string captures important properties concerning re- 
peated occurrences of substrings in the string, and has obvious applications in the field 
of data compression, as well as being the key component to various efficient algorithms 
on strings [|2l [3]. Consequently, many algorithms for its efficient calculation have been 
proposed. The LZ77 factorization of a string S is a factorization S = f\ ■ ■ ■ f n where each 
factor fk is either (1) a single character if that character does not occur in fx ■ ■ ■ fk-i, or, 
(2) the longest prefix of the rest of the string which occurs at least twice in fi ■ ■ ■ fk. 

A naive algorithm that computes the longest common prefix with each of the O(N) 
previous positions only requires 0(1) working space (excluding the output), but can take 
0(N 2 ) time, where N is the length of the string. Using string indicies such as suffix 
trees [4] and on-line algorithms to construct them [5], the LZ factorization can be computed 
in an on-line manner in 0(iVlog |S|) time and O(N) space, where |E| is the size of the 
alphabet. 

Most recent efficient linear time algorithms are off-line, running in O(N) time for in- 
teger alphabets using O(N) space (See Tabled)). They first construct the suffix array [|6] 
of the string, and compute an array called the Longest Previous Factor (LPF) array from 
which the LZ factorization can be easily computed [0 [8l |9l [10l [HI . Many algorithms of 
this family first compute the longest common prefix (LCP) array prior to the computation 
of the LPF array. However, the computation of the LCP array is also costly. The algo- 
rithm CI1 (COMPUTEXPF) of flU, and the algorithm LZJDG QH cleverly avoids its 
computation and directly computes the LPF array. 

An important observation here is that the LPF is actually more information than is re- 
quired for the computation of the LZ factorization, i.e., if our objective is the LZ factoriza- 
tion, we only use a subset of the entries in the LPF . However, the above algorithms focus 
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Table 1. Fast Linear time LZ-Factorization Algorithms based on Suffix Arrays 
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on computing the entire LPF array, perhaps since it is difficult to determine beforehand, 
which entries of LPF are actually required. Although some algorithms such as a variant of 
CPSI (H or CPS2 in (81 avoid computation of LPF, they either require the LCP array, or 
do not run in linear worst case time and are not as efficient. (See IfTTI for a survey.) 

In this paper, we propose a new approach to avoid the computation of LCP and LPF 
arrays altogether, by combining the ideas of the naive algorithm with those of CI1 and 
LZ_OG, and still achieve worst case linear time. The resulting algorithm is surprisingly 
both simple and efficient. 

Computational experiments on various data sets shows that our algorithm constantly 
outperforms LZ_OG ifTOl . and can be up to 2 to 3 times faster in the processing after 
obtaining the suffix array, while requiring the same or a little more space. 

Although our algorithm might be considered as a simple combination of ideas appear- 
ing in previous works, this paper is one of the first to propose, implement and evaluate this 
combination. We note that algorithms that avoid the computation of LCP and LPF based 
on similar ideas as in this paper were developed independently and almost simultaneously 
by Kempa and Puglisi lfT3ll and Karkkainen et al. fl4ll . Since we did not have knowledge 
of their work until very recently, we have not made comparisons between them. The worst 
case time complexity of lfT3l is not independent of alphabet size, but is fast and space 
efficient. In the more recent manuscript [|T4|- two new linear time algorithms which out- 
perform all previous algorithms (including ours) in terms of time and space are proposed, 
asserting the potential of this approach. 



2 Preliminaries 



Let Af be the set of non-negative integers. Let E be a finite alphabet. An element of 
S* is called a string. The length of a string T is denoted by |T|. The empty string e is the 
string of length 0, namely, |e| = 0. Let £+ = S* - {e}. For a string S = XYZ, X, Y 
and Z are called a prefix, substring, and suffix of T, respectively. The set of prefixes of T 
is denoted by prefix(T). The longest common prefix of strings X, Y, denoted lcp(X, Y), 
is the longest string in prefix (X) n prefix (Y). 

The i-th character of a string T is denoted by T[i] for 1 < i < \T\, and the substring 
of a string T that begins at position i and ends at position j is denoted by T[i..j] for 
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1 < i < j < \T\. For convenience, let T[i..j] = e if j < i, and T[\T\ + 1] = $ where $ is 
a special delimiter character that does not occur elsewhere in the string. 

2.1 Suffix Arrays 

The suffix array [|6] SA of any string T is an array of length |T| such that for any 1 < 
i < \T\, SA[i] = j indicates that T[j : \T\] is the i-th lexicographically smallest suffix of 
T. For convenience, assume that SA[0] = \T\ + 1. The inverse array SA^ 1 of SA is an 
array of length \T\ such that SyT 1 ^^]] = i. As in |fT3H , let $ be an array of length \T\ 
such that $[5i4[l]] = |T| and $[SA[i]] = SA[i - 1] for 2 < z < |T|, i.e., for any suffix 
j = SA[i], $[j] = SA [i — 1] is the immediately preceding suffix in the suffix array. The 
suffix array SA for any string of length |T| can be constructed in 0(|T|) time regardless of 
the alphabet size, assuming an integer alphabet (e.g. [fT6|). All our algorithms will assume 
that the SA is already computed. Given SA, arrays SA- 1 and $ can easily be computed in 
linear time by a simple scan. 

2.2 LZ Encodings 

LZ encodings are dynamic dictionary based encodings with many variants. The variant 
we consider is also known as the s -factorization [fTTI . 

Definition 1 (LZ77-factorization) The s -factorization of a string T is the factorization 
T = fi • ■ ■ f n where each s-factor f k G S + (k = 1, . . . , n) is defined inductively as 
follows: fi = T[l]. For k > 2: if T[\fi ■ ■ ■ fk-i\ + 1] = c G £ does not occur in 
/i • • • fk-i> tnen fk = c. Otherwise, fk is the longest prefix of fk • • ■ f n that occurs at least 
twice in fx--- f k . 

Note that each LZ factor can be represented in constant space, i.e., a pair of integers where 
the first and second elements respectively represent the length and position of a previous 
occurrence of the factor. If the factor is a new character and the length of its previous 
occurrence is 0, the second element will encode the new character instead of the position. 
For example the s-factorization of the string T = abaabababaaaaabbabab is a, b, a, aba, 
baba, aaaa, b, babab. This can be represented as (0, a), (0, b), (1, 1), (3, 1), (4, 5), (4, 10), 



We define two functions LPF and PrevOcc below. For any 1 < i < N, LPF{i) is the 
longest length of longest common prefix between T[i : N] and T[j : N] for any 1 < j < i, 
and PrevOcc(i) is a position j which achieves gives More precisely, 



(1,2), (5,5). 



LPF{i) 



max({0} U {lcp{T[% : N],T[j : N]) | 1 < j < i}) 



and 





There can be multiple choices of j, but here, it suffices to fix one. 
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Algorithm 1: LZ Factorization from LPF and PrevOcc arrays 
Input : String T, LPF, PrevOcc 

1 p 4- 1; 

2 while p < N do 

3 if ZPF [p] = then Output: ( 1 , T\p] ) 

4 else Output: ( LPF [p] , PrevOcc [p] ) 

5 p ^— p + max(l, LPF[p]); 



where j satisfies 1 < j < i, and T[i : i + LPF{i) - 1] = T[j : j + LPF{i) - 1]. Let 
Pfc = |/i • ■ ■ /as-i| + 1- Then, / fe can be represented as a pair (LPF(p k ), PrevOcc(p k )) if 
LPF(p k ) > 0, and (0, T[p fc ]) otherwise. 

Most recent fast linear time algorithms for computing the LZ factorization calculate 
LPF and PrevOcc for all positions 1 < i < N of the text and store the values in an array, 
and then use these values as in Algorithm [TJ to output the LZ factorization. 

3 Algorithm 

We first describe the naive algorithm for calculating the LZ factorization of a string, and 
analyze its time complexity The naive algorithm does not compute all values of LPF and 
PrevOcc as explicit arrays, but only the values required to represent each factor. The 
procedure is shown in Algorithm |2] For a factor starting at position p, the algorithm 
computes LPF(p) and PrevOcc(p) by simply looking at each of its p — 1 previous po- 
sitions, and naively computes the longest common prefix (lcp) between each previous 
suffix and the suffix starting at position p, and outputs the factor accordingly. At first 
glance, this algorithm looks like an 0(N 3 ) time algorithm since there are 3 nested loops. 
However, the total time can be bounded by 0(N 2 ), since the total length of the longest 
lcp's found for each p in the algorithm, i.e., the total length of the LZ factors found, is 
N. More precisely, let the LZ factorization of string T of length N be fi ■ ■ ■ f n , and 
Pk = \fi • • ■ fk-i\ + 1 as before. Then, the number of character comparisons executed in 
Line |6] of Algorithm |2] when calculating f k is at most (p k — l)\fk + 1 J, and the total can 
be bounded: ELifa* ~ W* + X l < ^ELi I/* + M = 0(N 2 ). An important observa- 
tion here is that if we can somehow reduce the number of previous candidate positions for 
naively computing lcp's (i.e. the choice of j in Line|4]of Algorifhm[2]) from O(N) to 0(1) 
positions, this would result in a O(N) time algorithm. This very simple observation is the 
first key to the linear running times of our new algorithms. 

To accomplish this, our algorithm utilizes yet another simple but key observation made 
in |fT2j. Since suffixes in the suffix arrays are lexicographically sorted, if we fix a suffix 
SA[i] in the suffix array, we know that suffixes appearing closer in the suffix array will 
have longer longest common prefixes with suffix SA [i] . 

For any position 1 < i < N of the suffix array, let 



PSV lex [i] 
NSV lex {i] 



max({0} U {1 < j < i | SA[j] < SA[i\}) 
min({0} U{N>j>i \ SA[j] < SA\i}}) 
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Algorithm 2: Naive Algorithm for Calculating LZ factorization 
Input : String T 

2 while p < \T\ do 

LPF <- 0; 

for j 4- 1, . . . ,p — 1 do 

/ <- 0; 

while T[j + Z] = T\p + 1} do I «- I + 1; / / 
if / > LPF then LPF <- /; PrevOcc «- j; 

if ZPF > then Output: (LPF, PrevOcc) 
else Output: (0,T[p]) 
p p + max(l, LPF); 



I ^ lcp(T[j : N},T[p : N}) 



i.e., for the suffix starting at text position £/4[i], the values P^V^fi] and M>Vj ea ;[i] repre- 
sent the lexicographic rank of the suffixes that start before it in the string and are lexico- 
graphically closest (previous and next) to it, or if such a suffix does not exist. From the 
above arguments, we have that for any text position 1 < p < N, 

LPF(p) = maxilcpiTiSAiPSVUSA^lp]]] : N],T[p : N]), 
IcpiTiSAiNSV^SA^p]]} : N],T\p : TV])). 

The above observation or its variant has been used as the basis for calculating LPF[i) 
for all 1 < i < N in linear time in practically all previous linear time algorithms for LZ 
factorization based on the suffix array. In [[TOl . they consider (implicitly) the arrays in text 
order rather than lexicographic order. In this case, 

PSV text [SA\i}] = SA[PSV lex \i}] 
NSV text [SA\i\] = SA[NSV le M 

and therefore 

LPF(p) = mzx(lcp(T[PSV text [p}} : N],T[p : N]), lcp(T[NSV text [p]] : N],T\p : N])). 

While |[T2| and [[TOl utilize this observation to compute all entries of LPF in linear time, 
we utilize it in a slightly different way as mentioned previously, and use it to reduce the 
candidate positions for calculating PrevOcc(i) (i.e. the choice of j in Algorithmic]) to only 
2 positions. The key idea of our approach is in the combination of the above observation 
with the amortized analysis of the naive algorithm, suggesting that we can defer the compu- 
tation of the values of LPF until we actually require them for the LZ factorization and still 
achieve linear worst case time. If PSVi ex [i] and NSVi ex [i] (or PSV text [i} and NSV text [i]) 
are known for all 1 < i < N, the linear running time of the algorithm follows from the 
previous arguments. The basic structure of our algorithm is shown in Algorithm |3] when 
using PSVi ex and NSVi ex . Note that it is easy to replace them with PSV tex t and NSV tex u 
and in such case, SA and SA^ 1 axe not necessary once we have PSV tex t and NSV tex t- 
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What remains is how to compute PSVi ex [i] and NSVi ex [i], or, PSV tex t[i] and NSV tex t[i] 
for all 1 < i < N. This can be done in several ways. We consider 3 variations. 

The first is a computation of PSVi ex [i], NSVi ex [i] using a simple linear time scan of the 
suffix array with the help of a stack. The procedure is shown in Algorithm |4] This variant 
requires the text, and the arrays SA, SA -1 , PSVi ex , NSVi ex and a stack. The total space 
complexity is 17 N + AS max bytes assuming that an integer occupies 4 bytes, where S max 
is the maximum size of the stack during the execution of the algorithm and can be B(n) in 
the worstcase. We will call this variant BGS. 

The other two is a process called peak elimination, which is very briefly described in [fl2l 
for lexicographic order (Shown in Algorithms |5]and |6), and in IfTOl for text order (Shown 
in Algorithms [7] and [8). In peak elimination, each suffix i and its lexicographically pre- 
ceding suffix j (SA^lj] + 1 = &4 _1 [z]) is examined in some order of i (lexicographic 
or text order). For simplicity, we only briefly explain the approach for text order. If 
i > j, this means that PSV tex t[i} = j and if % < j, NSV tex t[j] = i- When both values 
of PSVtext[i] and NSV te xt[i\ are determined, i is identified as a peak. Given a peak i, it is 
possible to eliminate it, and determine the value of either NSV text [PSV text [i]] (which will 
be NSV text {i\ if PSV text [i\ > NSV text \i\) or PSV text [NSV text \i}\ (which will be PSV text \i] 
if if PSVtext[i] < NSV te xt[i})), and this process is repeated. The algorithm runs in linear 
time since each position can be eliminated only once. The procedure for lexicographic or- 
der is a bit simpler since the lexicographic order of calculation implies that PSVi ex [i] will 
always be determined before NSVi ex [i]. 

The algorithm of [fTOTl actually computes the arrays LPF and PrevOcc directly with- 
out computing PSV tex t and NSV text . The algorithm we show is actually a simplification, 
deferring the computation of LPF and PrevOcc, computing PSVtext and NSV tex t instead. 

For lexicographic order, we need the text and the arrays SA, SA" 1 , PSVi ex , NSVi ex and 
no stack, giving an algorithm with 17 N bytes of working space. We will call this variant 
BGL. For text order, although the $ array is introduced instead of the SA" 1 array, the 
suffix array is not required after its computation. Therefore, by reusing the space of SA 
for PSVtext, the total space complexity can be reduced to 13N bytes. We will call this 
variant BGT Note that although peakElimi ex and peakElim tex t are shown as recursive 
functions for simplicity, they are tail recursive and thus can be optimized as loops and will 
not require extra space on the call stack. 

3.1 Interleaving PSV and NSV 

Since accesses to PSV and NSV occur at the same or close indices, it is possible to 
improve the memory locality of accesses by interleaving the values of PSV and NSV, 
maintaining them in a single array as follows. Let PNSV be an array of length 2N, and 
for each position 1 < % < 2N, PNSV[i] = PSV[j] if i mod 2 = 0, NSV[j] otherwise, 
where j = \i/2\. Naturally, for any 1 < i < N, PSV and NSV can be accessed as 
PSV[i\ = PNSV[2i] and NSV[i] = PNSV[2i + 1]. This interleaving can be done for 
both lexicographic order and text order. We will call the variants of our algorithms that 
incorporate this optimization, iBGS, iBGL, iBGT 
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Algorithm 3: Basic Structure of our Algorithms. 



Input : String T 

1 Calculate PSV lex [i] and NSV lex \i) for alH = 1...N; 

2 p <- 1; 

3 while p < N do 

LPF <- 0; 

for j e {SAiPSVUSA^lp}}}, SAiNSVUSA^lp}}}} do 
I +- 0; 

while T[j + Z] = T\p + Z] do I <- I + 1; / / Z <- M T b' : iV], T[p : N]) 
if / > LPF then LPF <- Z; PrevOcc <- j; 

if ZPF > then Output: (LPF, PrevOcc) 
else Output: (0,T[p]) 
p ^— p + max(l, LPF); 



Algorithm 4: Calculating PSV^ and iVS'Ffe from SA 
Input : Suffix array SA 
Output: PSV^ 

1 Let S be an empty stack; 

2 for z ^— 1 to do 

3 x <- SA[i}; 

4 while (not S. empty ()) and (^[S 1 . Zop()] > a;) do 
s [_ NSV lex [S.topQ] ^ i; S.popQ ; 

6 PS V tex [i] <- if S.emptyi) then else S.topQ ; 

7 S.push{i); 

8 while not S. empty Q do 

9 [_ iWVtet&fopO] ^O^.popQ ; 



4 Computational Experiments 

We implement and compare our algorithms with LZ_OG since it has been shown to be 
the most time efficient in the experiments of ifTOl . We also implement a variant LZ_iOG 
which incorporates the interleaving optimization for LPF and PrevOcc arrays. We have 
made the source codes publicly available at |http : / / code . google . com/p/lzbg7| 

All computations were conducted on a Mac Xserve (Early 2009) with 2 x 2.93 GHz Quad 
Core Xeon processors and 24GB Memory, only utilizing a single process/thread at once. 
The programs were compiled using the GNU C++ compiler (g++) 4.2.1 with the -fast 
option for optimization. The running times are measured in seconds, starting from after 
the suffix array is built, and the average of 10 runs is reported. 

We use the data of |http : / / www . cas .mcmaster . ca/ "bill /strings/! used 
in previous work. Table |2]shows running times of the algorithms, as well as some statistics 
of the dataset. The running times of the fastest algorithm for each data is shown in bold. 
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Algorithm 5: Calculating PSVi ex and NSVi ex from SA by Peak Elimination. 
Input : Suffix array SA 

1 for i <r- 1 to N do NSVi ex [i] <- 0; 

2 P5V iea; [l] ^0; 

3 for i 2 to N do peakElimi ex (i — 



Algorithm 6: Peak Elimination peakElimi ex (j, i) in Lexicographic Order. 
i if j = or &4[j] < then 

3 else // j>l and &4[j] > SA[i] 

5 peakElimi ex (PSVi ex \j},i) ; // j was peak. 



The fastest running times for the variant that uses only 13iV bytes is prefixed with V. 

The results show that all the variants of our algorithms constantly outperform LZ_OG 
and even LZ_iOG for all data tested, and in some cases can be up to 2 to 3 times faster. 
We can see that iBGS is fastest when the data is not extremely repetitive, and the average 
length of the factor is not so large, while iBGT is fastest for such highly repetitive data. 
iBGT is also the fastest when we restrict our attention to the algorithms that use only 13 N 
bytes of work space. 

Acknowledgments: We thank Dr. Simon Gog for sharing his implementation of LZ_OG, 
and Dr. Simon Puglisi for sending us the manuscripts [[T3l [T4l . 
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Table 2. Running times (seconds) of algorithms and various statistics for the data set 
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